Skip to content

feat: add Confluence metadata extractor#515

Merged
ravisuhag merged 2 commits intomainfrom
feat/confluence-extractor
Apr 18, 2026
Merged

feat: add Confluence metadata extractor#515
ravisuhag merged 2 commits intomainfrom
feat/confluence-extractor

Conversation

@ravisuhag
Copy link
Copy Markdown
Member

Summary

  • Adds a new Confluence extractor that extracts page metadata and relationships from Confluence spaces via the REST API v2
  • Emits space and document entities with belongs_to, child_of, owned_by, and documented_by edges
  • Scans page content for URN references to auto-link documentation to data assets
  • Supports filtering by space keys and excluding specific spaces

Details

New files:

  • plugins/extractors/confluence/confluence.go — Main extractor with Config, Init, Extract
  • plugins/extractors/confluence/client.go — HTTP client for Confluence REST API v2 (spaces, pages, labels, cursor-based pagination)
  • plugins/extractors/confluence/confluence_test.go — 6 unit tests covering config validation, extraction, edges, URN detection, exclusion
  • plugins/extractors/confluence/README.md — Documentation
  • test/e2e/confluence_file/confluence_file_test.go — End-to-end test with mock server through full pipeline

Entities emitted:

Type Description
space Confluence space metadata
document Page metadata (title, labels, version, timestamps)

Edges emitted:

Type Source → Target Description
belongs_to document → space Page belongs to a space
child_of document → document Page hierarchy
owned_by document → user Page author
documented_by document → any URN references found in page content

Closes #503 (Confluence portion)

Test plan

  • Unit tests pass (go test -tags plugins ./plugins/extractors/confluence/)
  • E2E test passes (go test -tags integration ./test/e2e/confluence_file/)
  • go build ./... succeeds
  • Review edge types and entity properties for consistency with existing extractors

Extract page metadata and relationships from Confluence spaces via the
REST API v2. Emits space and document entities with belongs_to, child_of,
owned_by, and documented_by edges. Scans page content for URN references
to auto-link documentation to data assets.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
meteor Ready Ready Preview, Comment Apr 18, 2026 10:47pm

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

Warning

Rate limit exceeded

@ravisuhag has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 51 minutes and 2 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 51 minutes and 2 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2d8c9d6b-4c8a-46b1-83d4-cddae3e801d8

📥 Commits

Reviewing files that changed from the base of the PR and between fca4a24 and a41cb6b.

📒 Files selected for processing (3)
  • plugins/extractors/confluence/client.go
  • plugins/extractors/confluence/confluence.go
  • plugins/extractors/confluence/confluence_test.go
📝 Walkthrough

Walkthrough

A new Confluence extractor plugin was added to extract metadata and relationships from Confluence spaces and pages via the REST API v2. The implementation includes a REST client (client.go) for API interactions, core extractor logic (confluence.go) that retrieves spaces and pages, extracts document metadata, detects embedded URN references through regex scanning, and emits space and document records with relationship edges (belongs_to, child_of, owned_by, documented_by). Supporting tests and documentation were also added, along with plugin registration in the extractors populate file.

Sequence Diagram

sequenceDiagram
    participant E as Extract Flow
    participant C as Confluence Client
    participant API as Confluence REST API v2
    participant Emit as Record Emitter

    E->>C: GetSpaces(ctx, keys)
    C->>API: GET /spaces (with pagination via cursor)
    API-->>C: Spaces list
    C-->>E: []Space

    loop For each space (not excluded)
        E->>Emit: Emit space record
        E->>C: GetPages(ctx, spaceID)
        C->>API: GET /spaces/{id}/pages (cursor pagination, storage format)
        API-->>C: Pages list
        C-->>E: []Page

        loop For each page
            E->>C: GetPageLabels(ctx, pageID)
            C->>API: GET /pages/{id}/labels
            API-->>C: Labels
            C-->>E: []Label
            
            E->>E: Extract metadata, scan body for URNs
            E->>Emit: Emit document record
            E->>Emit: Emit belongs_to edge (space)
            E->>Emit: Emit child_of edge (parent page if exists)
            E->>Emit: Emit owned_by edge (author)
            E->>Emit: Emit documented_by edges (per detected URN)
        end
    end
Loading
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add Confluence metadata extractor' clearly describes the main change: adding a new Confluence extractor for metadata extraction.
Description check ✅ Passed The description provides comprehensive details about the new Confluence extractor, including objectives, entities, edges, test coverage, and linked issues.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from issue #503: extracts page metadata (title, space, author, labels), page hierarchy relationships, emits documented_by edges via URN scanning, and supports space filtering/exclusion.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the Confluence extractor specified in issue #503. The plugin registration, client implementation, extractor logic, and comprehensive tests are all in-scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/extractors/confluence/client.go`:
- Around line 152-159: GetPageLabels currently only fetches the first page;
implement cursor-based pagination like GetSpaces/GetPages by looping until there
is no next cursor. Change GetPageLabels to call c.get repeatedly with query
params limit and cursor (or follow the returned _links.next), accumulate
resp.Results into a single slice, and update the local resp struct to include
the pagination metadata (e.g., a Links or _links.next field) so you can extract
the next cursor; ensure errors from c.get are wrapped as before and return the
full aggregated []Label when done.

In `@plugins/extractors/confluence/confluence_test.go`:
- Around line 149-155: The test currently loops over records := emitter.Get()
and only asserts each found "space_key" is not "ARCHIVE", which silently passes
if no records are emitted; update the test to first assert that records is not
empty (e.g., assert.NotEmpty or assert.Greater(len(records), 0) on records
returned by emitter.Get()), then iterate the records from emitter.Get() and
assert that at least one record's props["space_key"] is present and not equal to
"ARCHIVE" (set a found flag while inspecting r.Entity().GetProperties().AsMap()
and assert the flag is true). Ensure you reference the same symbols
(emitter.Get(), records, r.Entity().GetProperties().AsMap(), "space_key") when
making these assertions so the test fails if extraction or filtering removes all
spaces.

In `@plugins/extractors/confluence/confluence.go`:
- Around line 90-95: The current code ignores errors from the emit callback and
downgrades e.extractPages failures to warnings; update the logic so emit
failures and page extraction errors are propagated up instead of suppressed:
check the return value from emit(e.buildSpaceRecord(space)) and return that
error if non-nil, and if e.extractPages(ctx, emit, space) returns an error
return it (don’t just log a warning). Apply the same change where pages are
emitted (the other occurrence noted around line 114) so both emit calls and all
e.extractPages failures bubble up to the caller.
- Around line 148-155: The timestamp formatting in the props map uses the
literal layout "2006-01-02T15:04:05Z" for page.CreatedAt and
page.Version.CreatedAt which forces a literal 'Z' instead of emitting proper
timezone offsets; update those calls to use time.RFC3339 (e.g.,
page.CreatedAt.Format(time.RFC3339) and
page.Version.CreatedAt.Format(time.RFC3339)) and add the missing import "time".
Ensure the changes are applied where props is constructed (referencing
page.CreatedAt and page.Version.CreatedAt) so timestamps include correct
timezone information.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e3d9b77-cb68-48d4-b786-32d81029a1cd

📥 Commits

Reviewing files that changed from the base of the PR and between 84a63a2 and fca4a24.

📒 Files selected for processing (6)
  • plugins/extractors/confluence/README.md
  • plugins/extractors/confluence/client.go
  • plugins/extractors/confluence/confluence.go
  • plugins/extractors/confluence/confluence_test.go
  • plugins/extractors/populate.go
  • test/e2e/confluence_file/confluence_file_test.go

Comment thread plugins/extractors/confluence/client.go Outdated
Comment thread plugins/extractors/confluence/confluence_test.go Outdated
Comment on lines +90 to +95
emit(e.buildSpaceRecord(space))

if err := e.extractPages(ctx, emit, space); err != nil {
e.logger.Warn("failed to extract pages from space, skipping",
"space", space.Key, "error", err)
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate emit and page extraction failures.

Line 90 and Line 114 ignore emitter failures, and Lines 92-95 downgrade page extraction failures to a warning. That can make a run succeed with missing records or failed downstream writes.

Proposed error propagation
-		emit(e.buildSpaceRecord(space))
+		if err := emit(e.buildSpaceRecord(space)); err != nil {
+			return fmt.Errorf("emit space %s: %w", space.Key, err)
+		}
 
 		if err := e.extractPages(ctx, emit, space); err != nil {
-			e.logger.Warn("failed to extract pages from space, skipping",
-				"space", space.Key, "error", err)
+			return fmt.Errorf("extract pages from space %s: %w", space.Key, err)
 		}
-		emit(e.buildPageRecord(page, space, spaceURN, labels))
+		if err := emit(e.buildPageRecord(page, space, spaceURN, labels)); err != nil {
+			return fmt.Errorf("emit page %s: %w", page.ID, err)
+		}

Also applies to: 114-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/extractors/confluence/confluence.go` around lines 90 - 95, The
current code ignores errors from the emit callback and downgrades e.extractPages
failures to warnings; update the logic so emit failures and page extraction
errors are propagated up instead of suppressed: check the return value from
emit(e.buildSpaceRecord(space)) and return that error if non-nil, and if
e.extractPages(ctx, emit, space) returns an error return it (don’t just log a
warning). Apply the same change where pages are emitted (the other occurrence
noted around line 114) so both emit calls and all e.extractPages failures bubble
up to the caller.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable — plugins.Emit is defined as func(models.Record) with no return value, so there's no error to propagate. The warn-and-skip pattern for page extraction matches other extractors (e.g. GitHub).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the Emit type definition
rg -n "type Emit" --type go
# Check GitHub extractor for warn-and-skip pattern
rg -n "Warn.*extract.*skip\|Warn.*pages\|extractPages" plugins/extractors/github/ -A2 2>/dev/null || true
rg -n "logger.Warn.*skip" plugins/extractors/ --type go -A2 | head -40

Length of output: 2553


@ravisuhag, you're absolutely right — confirmed on both points:

  1. plugins.Emit signature: plugins/plugin.go Line 19 defines it as type Emit func(models.Record) — no return value, so there's nothing to propagate.
  2. Warn-and-skip pattern: The GitHub extractor uses the same pattern in multiple places (e.g., Lines 151, 256, 331, 345, 386, 391, 451 of plugins/extractors/github/github.go), making the Confluence extractor's approach consistent with the established convention.

My review comment was incorrect. Sorry for the noise!


✏️ Learnings added
Learnt from: ravisuhag
Repo: raystack/meteor PR: 515
File: plugins/extractors/confluence/confluence.go:91-96
Timestamp: 2026-04-18T22:46:54.415Z
Learning: In the raystack/meteor codebase, `plugins.Emit` is defined as `func(models.Record)` (no return value) in `plugins/plugin.go`. There is no error to propagate from emit calls. The warn-and-skip pattern (`e.logger.Warn(...)` + `continue`) for non-fatal per-item errors in extractors is the established convention, as used in `plugins/extractors/github/github.go` and others.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: raystack/meteor PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-18T03:36:16.367Z
Learning: Applies to plugins/extractors/**/*.go : Extractors should emit Records containing Entity (urn, type, name, description, source, properties) and Edges (source_urn, target_urn, type, source, properties) for relationships

Comment thread plugins/extractors/confluence/confluence.go
- Paginate GetPageLabels to capture all labels (not just first page)
- Use time.RFC3339 for proper timezone handling in timestamps
- Tighten exclusion test to assert ENG space exists (not just absence of ARCHIVE)
@ravisuhag ravisuhag merged commit b560a76 into main Apr 18, 2026
55 checks passed
@ravisuhag ravisuhag deleted the feat/confluence-extractor branch April 18, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add documentation extractor for Confluence and Notion

1 participant